Lab Assignment One: Exploring Table Data

CS 5324

2021-02-14

Anthony Wang

Business Understanding

This dataset contains information about United States on-duty firefighter fatalities that have occurred since January 1, 2000. It is aggregated by the United States Fire Administration primarily via notification from individual fire departments, but other sources include the U.S. Department of Justice, the National Institute fo Occupational Safety and Health, the Occupational Safety and Health Administration, the U.S. Department of Defense, and the National Interagency Fire Center. This data is collected to find and evaluate methods of reducing firefighter deaths.

This analysis focuses on predicting the nature and cause of death given various combinations of the age, classification, date of incident, date of death, and activity of the firefighter as well as whether the firefighter was on emergency duty. Such information would show how causes of fatalities have changed in more modern times compared to the beginning of the century. The USFA and individual fire departments could identify what conditions have seen reduction in fatalities and sebsequently evaluate the efficacy of solutions to preserve firefighter safety. Alternatively, certain causes of fatalities may be on the rise in recent years, and new solutions could be devised to save firefighter lives.

No prior predictions of this nature could be found, so there exists no precedent which accuracy can be compared to. Therefore, this analysis operates with the belief that a prediction accuracy over 50% for the nature of fatality and cause of fatality would be useful to departments and agencies.

Data Understanding

The summary is difficult to parse meaningfully and incredibly dependent on individual circumstances. The name features are not relevant to the prediction task. The Memorial fund information suffers from the previous problems in addition to having too many null values. Rank appears to be a valuable ordinal feature, but because there is no single standard system of rank in use, there are many unique values which occur very few times. Additionally, due to the many firefighter rank systems, it is difficult to determine whether a rank in one system outranks a rank in another system, making ordinality near impossible to define. For these reasons the mentioned features will be excluded from analysis.

The remaining features need datatype tweaking before use. The dates need conversion from string to datetime and the binary Emergency duty feature would be better suited as a boolean.

A data description table is useful for presenting a general overview of features.

There are duplicated instances in the features. These are unique real-world firefighter fatalities that had circumstances recorded identically by the USFA.

Many of the features possess null values. For the Age and Property type features, k-nearest-neighbor would be a good choice for imputation because the curation of the USFA ensures data is properly labeled and free of noise. The dataset is also small emough to calculate the nearest neighbors in a timely fashion.

For the other features containing null values, their instances comprise less than 2% of the dataset and can be eliminated without much consequence.

After elimination, this analysis will choose to simply use the dataset as is for plotting and analysis because of its author's lack of expertise in generating a distance function from nominal and datetime features. Unfortunately, this results in losing out on large portions of the dataset when the Age and Property type features are used, which is far from ideal. Should further or future analyses occur outside the scope of this lab, an appropriate distance function and imputation of Age and Property type would be a valuable addition.

Data Visualization

The split violin plot of firefighter fatality ages poses questions about what firefighters are sent to which situations. The immediate implication of the more uniform fatality age distribution for emergency responses is that emergency situations warrant an all-hands-on-deck scenario. Such a scenario would require firefighters to attend the situation regardless of their age, exposing them to the risk of fatality to similar degrees.

But why is there a more peak above the average age for fatalities during non-emergency situations? A possible set of answers includes older firefighters below the age of retirement having greater say in what assignments or calls they take. Seniority could permit them to take shifts that have more house calls than emergencies, and it's during these non-emergencies that they suffer the fatalities plotted.

At first glance it is also peculiar for firefighters to suffer nearly double the fatalities when responding to non-emergencies compared to emergencies. Any further contemplation arrives the thought that non-emergency dispatches must far outnumber emergency dispatches. Depending on what that factor is emergency fatalities being as many as half of non-emergency fatalities could be a testament to the danger associated to emergency dispatches.

The minor dip during 2020 may be attributable to the coronavirus pandemic and subsequent lockdown/quarantine. Reduced human human activity may have resulted in fewer situations warranting dispatches, which in turn results in fewer firefighters entering life-threatening situations.

There is a massive spike in firefighter deaths in the year 2001. This is merely a casual assumption and certainly not intended to be presented as a fact, but the September 11 attacks may be a plausible explanation. If that is the case, then the data points accumulated into the bar are certainly legitimate. But due to the extraordinary circumstances (i.e. the deadliest terrorist attack in human history) that led to the fatalities, should these be analyzed separately from the rest of the fatality data? The 2001 data almost certainly skews whatever trends would be drawn from the dataset as a whole.

The spike prompts a second question in relation to the null age values. There were many unidentified victims in the September 11 attacks, and many of them could have been firefighters. What is the effect of removing the instances containing null age values from the dataset then graphing this same plot? Would the spike be reduced at all? The following plot suggests the answer is yes.

It was not surprising to see truama far outnumber other natures of firefighter deaths. But it is incredibly surprising to see it still falls short of heart attacks. The large number of heart attack deaths prompted investigation into what the cause may be. The most obvious option is to plot a two-dimensional histogram which counts the number of instances sharing a certain cause of death and nature of death.

And thus a possible explanation arises. The large number of deaths caused by stress or overexertion frequently results in heart attacks or cerebrovascular accidents (i.e. strokes). This plot also uncovers relations from collapse, collisions, striking, and falls to trauma. Another relation appears between a firefighter being caught or trapped and their death from axphyxiation or burns.

UMAP Dimensionality Reduction

None of the UMAP reduced plots contain identifiable clusters of the same target value. This implies it is difficult to reduce the dimensionality between the four fitted features without unrecognizeably changing qualities of the raw data. No amount of tweaking of the n_neighbors and min_dist parameters proved fruitful. Such an outcome is likely due to the lack of relation both within the features which were fitted and between the features and targets. There is little to be analytically gained from a UMAP dimensionality reduction.